feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering) by breimanntools · Pull Request #271 · breimanntools/aaanalysis

breimanntools · 2026-06-25T14:28:58Z

Summary

Completes the parity-first half of #261 (deferred from PR #267) and substantially extends SeqOpt (pro). Builds on ADR-0043; recorded in ADR-0045.

⚠️ #261 stays open (no closing keyword) — ongoing.

Pure-Python EA operator set (DEAP-free runtime)

Beyond the NSGA-II core: varAnd/varOr variation; (μ+λ)/(μ,λ)/eaSimple survival; constraints (DeltaPenalty / ClosestValidPenalty); uniform/one-/two-point crossover; substitution/shift mutation; single-objective Hall of Fame; a cumulative Pareto archive (rank-0 = best-ever, none lost to crowding); hypervolume / spread / convergence metrics. Objectives accept any callable(sequence) -> float (external scikit/torch model or web API), cached per variant.

DEAP parity (dev/test-only oracle)

deap added to [dev] only. test_seqopt_deap_parity.py proves our sortNondominated/assignCrowdingDist/selNSGA2 match DEAP — identical rank (incl. ties), crowding values+ordering within atol, survivor profile. Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-fast is 3–7× faster than DEAP and dependency-free → ship ours. engine="exact"|"fast" give identical fronts; fast is memory-bounded (chunked, 2.6× leaner at n=3000).

Visualization (SeqOptPlot)

pareto_front (2-D/3-D), parallel_coordinates, convergence (best/mean/worst band), hypervolume, mutation_map (front substitution-enrichment heatmap), genealogy (mutational-lineage tree). cmap is a parameter throughout (package convention).

Docs / framing

The class docstring carries a DEAP-mapping table and clearly frames SeqOpt as protein engineering (ML-guided directed evolution, [Yang19]/[Wittmann21]) vs de novo design (RFdiffusion→ProteinMPNN→AlphaFold, [Yang26]). New: 8 per-method example notebooks (realistic GSEC "super-substrate" task) + tutorial7_protein_engineering.

Bugs fixed (found via realistic data)

mode="impact" kept the full df_seq_ref → NaN-tripped check_df_seq when the reference came from load_dataset; now position-cols only (+ regression test).

Verification

469-test broad gate green locally (SeqOpt suite, all meta-tests, docstrings, parity); merged current with master.

🤖 Generated with Claude Code

Pure-Python (no runtime dep), closing the gaps from the NSGA-II-only first cut: - variation varAnd/varOr; survival mu_plus_lambda/mu_comma_lambda/ea_simple - constraints (feasibility callables) with DeltaPenalty / ClosestValidPenalty - single-objective Hall of Fame (SeqOpt.hall_of_fame_) beside the Pareto archive - convergence metric (generational distance to a ref_front) in eval - engine='exact'|'fast' (numpy-vectorized non-dominated sort; numerically identical front, faster); crowding now uses DEAP's nobj*span normalization DEAP parity (dev/test-only oracle; runtime stays DEAP-free): - deap added to [dev]; test_seqopt_deap_parity.py asserts our sort/crowding/selNSGA2 reproduce DEAP's sortNondominated/assignCrowdingDist/selNSGA2 on synthetic fitness (identical rank incl. ties, crowding values+ordering within atol, selNSGA2 profile) - Phase-C comparison (.github/scripts/seqopt_deap_comparison.py): ours-exact/fast vs DEAP, correctness + wall-clock + peak memory -> ship-ours (fast beats DEAP, e.g. ~14ms vs ~102ms at 500x3, dependency-free) Docs: ADR-XXXX (number-last), CONTEXT.md EA-operator/engine/convergence terms, release note. 85 SeqOpt tests + 447 in the broader gate green; docstrings/param coverage clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pin what is actually invariant vs DEAP: non-dominated rank always identical (incl. ties); crowding values+ordering and selNSGA2 survivor profile identical on continuous fitness (within 1e-9); survivor rank-distribution identical under heavy ties (the exact tied-individual kept is arbitrary in DEAP too). Drops the over-strict exact-set/profile-under-duplicates claims that don't hold (boundary points tie at inf crowding even for continuous objectives).

…ence history Visualization (SeqOptPlot): new convergence (per-generation hypervolume + spread + per-objective best, from the new SeqOpt.history_), 3-D pareto_front (optional z), and parallel_coordinates for many-objective fronts. Per-generation history is now tracked (spread + per-objective best, not only hypervolume) and exposed as SeqOpt.history_. Objectives: a callable source now receives the variant SEQUENCE (fn(sequence)-> float) and is cached per distinct variant, so any external predictor — a scikit/ torch model or a sequence-level tool / web API — can be optimized jointly with the model-on-features objectives; pure-callable multi-objective runs need no CPP model. Two executed example notebooks (seqopt_convergence, seqopt_parallel_coordinates) demonstrate the views + the external-predictor recipe. 53 SeqOpt frontend tests (+19) and 459 in the broader gate green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rewrite all six SeqOpt example notebooks around a real task — 'design a super gamma-secretase substrate': load_features('DOM_GSEC') (150 CPP features) + load_dataset('DOM_GSEC') + a simple RandomForest, take a non-substrate wild-type and mutate its TMD to maximize predicted substrate probability with few mutations. They demonstrate run (nsga2/greedy, impact/importance, varOr/ea_simple/operators, constraints + Hall of Fame, external-predictor callable objective), eval (hypervolume/spread/convergence), and all four SeqOptPlot views (pareto_front 2-D/ 3-D, parallel_coordinates, convergence, hypervolume), with executed outputs. Fix (found via the realistic reference): SeqOpt mode='impact' refit kept the FULL df_seq_ref, so a reference from load_dataset (carrying jmd_n/tmd/jmd_c/label) NaN- tripped check_df_seq on the appended variant row. Now keep only the position-based columns; add a regression test with an extra-column reference. 460-test broad gate + docstrings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hero plots Critical-assessment improvements: - engine='fast' non-dominated sort now computes the dominance matrix in row-chunks (adaptive block), bounding the transient to O(block*n*m) vs O(n^2*m) — ~2.6x leaner peak memory at n=3000; identical fronts (parity unchanged). Realistic pool sizes were never a problem; this makes pathologically large populations safe. - run() keeps a cumulative non-dominated archive (DEAP ParetoFront analogue), merged into the final population so the returned rank=0 front is the best-ever set — no solution lost to per-generation crowding truncation. - history_ now tracks per-objective best/mean/worst per generation. Hero plots (the genre's standard views): - SeqOptPlot.mutation_map — position x amino-acid substitution-enrichment heatmap across the front (the directed-evolution 'which mutations won' view). - SeqOptPlot.convergence gains the classic GA best/mean/worst fitness band. New executed notebook seqopt_mutation_map; tests for mutation_map + the band + archive. 465-test broad gate + docstrings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- SeqOptPlot.genealogy: mutational-lineage tree (wild-type -> variants by accumulated mutations, linked by mutation-set containment, colored by the first objective) - the directed-evolution analogue of a genealogy tree, matplotlib-only (no networkx). - SeqOpt class docstring now carries a rendered list-table mapping every run/eval method + parameter value to its DEAP function (selNSGA2/varAnd/varOr/eaMuPlusLambda/ cxUniform/DeltaPenalty/...), with the aaanalysis-only rows called out. - New executed seqopt_genealogy notebook + tests. 273-test gate + docstrings clean.

…onsistency) pareto_front / parallel_coordinates / mutation_map / genealogy take a user-overridable cmap= (defaults unchanged), matching the CPPPlot / AAMutPlot / SeqMutPlot convention of colormap-as-parameter instead of a hardcoded name.

tutorial7_protein_design: an executed end-to-end case study — train a GSEC substrate classifier, design a 'super substrate' from a non-substrate, and read the result with every SeqOptPlot view (pareto_front 2-D/3-D, convergence, mutation_map, genealogy, parallel_coordinates) plus SHAP-guided impact mode. Wired into the Tutorials toctree under a new Protein Design section.

…) + refs Draw the paradigm distinction clearly in the SeqOpt class docstring, the tutorial, and CONTEXT.md: SeqOpt does protein *engineering* — machine-learning-guided directed evolution of an existing sequence [Yang19] — explicitly NOT de novo protein design (generating new proteins). Introduce de novo design as the contrasting paradigm via the canonical structure-first pipeline RFdiffusion [Watson23] -> ProteinMPNN [Dauparas22] -> AlphaFold [Jumper21], reviewed in [DeNovoReview26]. Add all five references to references.rst; tutorial retitled 'Protein Engineering with SeqOpt' with the distinction + hyperlinked refs; Tutorials toctree section renamed. Docstring citations resolve (0 defects); 104-test gate green.

…ce reviews Read the two provided reviews and fixed the citations: ML-guided directed evolution is Wittmann, Johnston, Wu & Arnold (2021), Curr. Opin. Struct. Biol. (not the Yang19 I had guessed); the de novo design review is Yang et al. (2026), Nature 652:1139. Sharpened the distinction in the SeqOpt docstring + tutorial + CONTEXT.md using the reviews' own framing (de novo = build new proteins from the ground up; engineering = iterative mutation/selection of an existing protein, ML learns the fitness model). Citations resolve; 64-test gate green.

…ed evolution Add [Yang19] (Nature Methods 2019, the foundational ML-guided directed-evolution-for- protein-engineering review) alongside [Wittmann21] in the SeqOpt docstring, tutorial and CONTEXT.md. Citations resolve.

…rity # Conflicts: # docs/source/index/release_notes.rst

…erators decision Number the previously number-less parity ADR (one past the current master max 0044 = find-features protocol), set status Accepted, regenerate INDEX.

codecov · 2026-06-25T15:44:58Z

Codecov Report

❌ Patch coverage is 94.75000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.13%. Comparing base (e7e272e) to head (2178624).

Files with missing lines	Patch %	Lines
aaanalysis/protein_design_pro/_seqopt_plot.py	92.95%	1 Missing and 9 partials ⚠️
...alysis/protein_design_pro/_backend/seqopt/nsga2.py	95.52%	1 Missing and 2 partials ⚠️
...ysis/protein_design_pro/_backend/seqopt/metrics.py	81.81%	1 Missing and 1 partial ⚠️
...ysis/protein_design_pro/_backend/seqopt/penalty.py	92.85%	1 Missing and 1 partial ⚠️
...analysis/protein_design_pro/_backend/seqopt/run.py	97.33%	1 Missing and 1 partial ⚠️
aaanalysis/protein_design_pro/_seqopt.py	97.22%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #271      +/-   ##
==========================================
+ Coverage   96.10%   96.13%   +0.02%     
==========================================
  Files         175      176       +1     
  Lines       16374    16733     +359     
  Branches     2796     2863      +67     
==========================================
+ Hits        15737    16087     +350     
+ Misses        369      363       -6     
- Partials      268      283      +15

Files with missing lines	Coverage Δ
aaanalysis/_constants.py	`100.00% <100.00%> (ø)`
...ysis/protein_design_pro/_backend/seqopt/metrics.py	`89.74% <81.81%> (+13.62%)`	⬆️
...ysis/protein_design_pro/_backend/seqopt/penalty.py	`92.85% <92.85%> (ø)`
...analysis/protein_design_pro/_backend/seqopt/run.py	`94.40% <97.33%> (+1.81%)`	⬆️
aaanalysis/protein_design_pro/_seqopt.py	`88.92% <97.22%> (+2.66%)`	⬆️
...alysis/protein_design_pro/_backend/seqopt/nsga2.py	`96.93% <95.52%> (-1.05%)`	⬇️
aaanalysis/protein_design_pro/_seqopt_plot.py	`92.04% <92.95%> (+1.56%)`	⬆️

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools and others added 13 commits June 25, 2026 05:42

Merge remote-tracking branch 'origin/master' into feat/seqopt-deap-pa…

20fae21

…rity # Conflicts: # docs/source/index/release_notes.rst

docs(adr): assign ADR-0045 to the SeqOpt DEAP-parity + pure-Python op…

2178624

…erators decision Number the previously number-less parity ADR (one past the current master max 0044 = find-features protocol), set status Accepted, regenerate INDEX.

breimanntools merged commit 8b88b35 into master Jun 25, 2026
17 checks passed

breimanntools deleted the feat/seqopt-deap-parity branch June 25, 2026 15:54

breimanntools mentioned this pull request Jun 25, 2026

refactor(seqopt): single NSGA-II kernel (drop engine knob) + rename tutorial to protein engineering #272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering)#271

feat(seqopt): pure-Python EA operators + DEAP parity + SeqOptPlot (protein engineering)#271
breimanntools merged 13 commits into
masterfrom
feat/seqopt-deap-parity

breimanntools commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 25, 2026

Summary

Pure-Python EA operator set (DEAP-free runtime)

DEAP parity (dev/test-only oracle)

Visualization (SeqOptPlot)

Docs / framing

Bugs fixed (found via realistic data)

Verification

Uh oh!

codecov Bot commented Jun 25, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant